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Abstract In this paper, we are concerned with regression problems where 
covariates can be grouped in nonoverlapping blocks, and where only a few of 
them are assumed to be active. In such a situation, the group Lasso is an at- 
tractive method for variable selection since it promotes sparsity of the groups. 
We study the sensitivity of any group Lasso solution to the observations and 
provide its precise local parameterization. When the noise is Gaussian, this 
allows us to derive an unbiased estimator of the degrees of freedom of the 
group Lasso. This result holds true for any fixed design, no matter whether it 
is under- or overdetermined. With these results at hand, various model selec- 
tion criteria, such as the Stein Unbiased Risk Estimator (SURE), are readily 
available which can provide an objectively guided choice of the optimal group 
Lasso fit. 
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1 Introduction 



1.1 Group Lasso 



Consider the linear regression problem 

y = Xf3 + e, (1) 

where y € MP is the response vector, /?o € M p is the unknown vector of re- 
gression coefficients to be estimated, X € M nxp is the design matrix whose 
columns are the p covariate vectors, and e is the error term. In this paper, we 
do not make any specific assumption on the number of observations n with 
respect to the number of predictors p. Recall that when n < p, (fT]) is an under- 
determined linear regression model, whereas when n ^ p and all the columns 
of X are linearly independent, it is overdetermined. 



Regularization is now a central theme in many fields including statis- 
tics, machine learning and inverse problems. It allows to reduce the space 
of candidate solutions by imposing some prior structure on the object to 
be estimate d. This regularization ranges from squared Euclidean or Hilbcr- 
tian norms (jTikhonov and Arseninl ll997). to non-Hilbertian norms that have 
sparked considerable interest in the recent years. Of particular interest are 
sparsity-inducing regulariz ations such as the I 1 norm which is an intensively 
active area of research, e.g. (ITibsh irani 1996; O sborne et al2 000; Donoho 2006: 
Candes and Planll2009l : lBickel et aJ2009t ): see ifBuhlmann and van de Geerll201lh 



for a comprehensive review. When the covariates are assumed to be clustered 
in a few active groups/blocks, the group Lasso has been advocated since it pro- 
motes sparsity of the groups, i.e. it drives all the coefficients in one group to 
zero t ogether hence leading to group se lection, see (|Bakinlll999l: lYuan and Linl 
120061 : lBacbll2008t IWei and Huang|l20irl to cite a few. 



Let B be a disjoint union of the set of indices i.e. UbeB = {!>••• >P} sucn 
that b, b' e B,bnb' = 0. For /3 £ MP, for each b e B, /3 b = (/3i) ieb is a subvector 
of (3 whose entries are indexed by the block b, and |6| is the cardinality of b. 
The group Lasso amounts to solving 

f}(y) G argmin - X/3|| 2 + A V (V x (»)) 

Pew 2 ^ 

where A > is the regularization parameter and || • || is the (Euclidean) £ 2 - 
norm. By coercivity of the penalty norm, the set of minimizers of (V\(y)) is a 
nonempty convex compact set. Note that the Lasso is a particular instance of 
p-'xjy)^ that is recovered when each block b is of size 1. 



1.2 Degrees of Freedom 

We focus in this paper on sensitivity analysis of any solution to \P\ {y)\ with 
respect to the observations y and the regularization parameter A. This turns 
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out to be a central ingredient to compute an estimator of the degrees of freedom 
(DOF) of the group Lasso response. The DOF is usually us ed to quantify the 
complexity of a statistical modeling procedure (|Efrortlll986[) . 

More precisely, let ju(y) = Xf3(y) be the response or the prediction asso- 
ciated to an estimator j3{y) of fa, and let ^ = Xfa. We recall that fi(y) is 
always uniquely defined (see Lemma [21), although j3(y) may not as is the case 
when X is a rank-deficient or underdetermined design matrix. Sup pose that e 
is an additive white Gaussian noise e ~ 7V(0, er 2 Id„). Following (lEfronlll986lh 
the DOF is given by 



df = 



i=l 



The well-known Stein's lemma asserts that, if j2(y) is a weakly diffcrentiable 
function for which 

then its divergence is an unbiased estimator of its DOF, i.e. 

df = div/i(y) = tx{dyp,{y)) and E e (df) = df , 

where d y j2(jj) is the Jacobian of fi(y). It is well known that in Gaussian regres- 
sion problems, an unbiased estimator of the DOF allows to get an unbiased 
of the predictio n risk esti mation EJ|/2fa ) — /io|| 2 through e.g. the Mallow's C p 
(|Mallowslll973h the AIC (<Akaikd[l97l or the SURE (Stein Unbiased Risk 
Estimate. ISteinl 1981 ). These quantities can serve as model selection criteria 
to assess the accuracy of a candidate model. 



1.3 Contributions 

This paper establishes a general result (Theorem [T]) on local parameterization 
of any solution to the group Lasso fPT(j/)]) as a function of the observation 
vector y. This local behavior result does not need X to be full column rank. 
With such a result at hand, we derive an expression of the divergence of the 
group Lasso response. Using tools from semialgebraic geometry, we prove that 
this divergence formula is valid Lebesgue-almost everywhere (Theorem[2]), and 
thus, this formula is a provably unbiased estimate of the DOF (Theorem [3]) . 
In turn, this allows us to deduce an unbiased estimate of the prediction risk 
of the group Lasso through the SURE. 



1.4 Relation to prior works 



In the special case of standard Lasso with a linearly independent design, 
( Zou et al 2007 ) show that the number of nonzero coefficients is an unbiased 
estimate for the degrees of freedom. This work is generalized in ( Dossal et all 
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20121 ) to any arbitrary design matrix. The DOF of the an alysis sparse regular- 



ization (a.k.a. generalized Lasso in statistics) is studied in (jTibshirani and Tavlor 



2012: iVaiter et alll2012bft . 



A formula of an estimate of the DOF for the grou p Lasso when the de - 
sign is orthogonal within each group is conjectured in ( Yuan and Lin 200£ 
Its unbiasedness is proved but only for an orthogonal design. (jKatol I200S 
studies the DOF of a general shrinkage estimator where the regression coef- 
ficients are constrained to a closed convex set C. This work extended that of 
(jMever and Woodroofdl2000n which treats t he case whe re C is a convex poly- 
hedral cone. When X is full column rank, ( Kato 20091 ) derived a divergence 
formula under a smoothness condition on the boundary of C, from which he 
obtained an unbiased estimator of the degrees of freedom. When specializing 
to the constrained version of the group Lasso, the author provided an unbiased 
estimate of the corres ponding DOF under the same group- wise orthogonality 
assumption on X as (lYuan and Lin 2006). An estima te of the DOF for the 
group Lasso is also given bv (ISolo and TJlfarssonll2010l) using heuristic deriva- 
tions that are valid only when X is full column rank, though its unbiasedness 
is not proved. 

In (jVaiter et alll2012af) . we derived an estimator of the DOF of the group 
Lasso and proved its unbiasedness when X is full column rank , but withou t 
the orthogonality assumption required in (|Yuan and Lml 120061: iKatol 12009). 
In this paper, we remove the full column rank assumption, which enables us 
to tackle the much more challenging rank-deficient or underdetermined case 
where p > n. 



1.5 Notations 

We start by some notations used in the rest of the paper. We extend the notion 
of support, commonly used in sparsity by defining the /3-support supp s (/3) of 
P e R" as 

supp B (/3) = {6 6B\||/3 b ||^0}. 

The size of supp e (/3) is defined as |supp e (/3)| = J2beB 1^1- The set °f an 
supports is denoted X. We denote by Xi, where I is a £>-support, the matrix 
formed by the columns Xj where i is an element of b 6 /. To lighten the no- 
tation in our derivations, we introduce the following block-diagonal operators 

6p : v el 1 ' 1 ^ (v b /\\f3 b \\) beI eRl'l 
and P f3 : v E R m >-> (Proj /S x(u 6 )) 6e j G M. W , 

where Projp± — Id — PbPb is the orthogonal projector on (3^ . For any matrix 
A, A T denotes its transpose. 
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1.6 Paper organization 

The paper is organized as follows. Sensitivity analysis of the group Lasso solu- 
tions to perturbations of the observations is given in Section[3] Then we turn to 
the degrees of freedom and unbiased prediction risk estimation in Section [3] 
The proofs are deferred to Section 0] awaiting inspection by the interested 
reader. 

2 Local Behavior of the Group Lasso 

The first difficulty we need to overcome when X is not full column rank is 
that f3(y) is not uniquely defined. Toward this goal, we are led to impose the 
following assumption on X with respect to the block structure. 

Assumption (A(/3)) : Given a vector (3 £ W of i3-support /, we assume that 
the finite subset of vectors {X b f3f, \ b £ 1} is linearly independent. 

It is important to notice that (A(/3)) is weaker than imposing that Xi is 
full column rank, which is standard when analyzing the Lasso. The two as- 
sumptions coincide for the Lasso, i.e. |6| = 1,V6 £ I. 

Let us now turn to sensitivity of the minimizers /3(y) of {V\{y)) to per- 
turbations of y. Toward this end, we will exploit the fact that j3(y) obeys an 
implicit parameterization. But as optimal solutions turns out to be not every- 
where differentiable, we will concentrate on a local analysis where y is allowed 
to vary in a neighborhood where non-differentiability will not occur. This is 
why we need to introduce the following transition space T-L. 

Definition 1 Let A > 0. The transition space TL is defined as 
W = [J [J Hi,b, where Hi,b = bd(ir(Ai,b)), 

where we have denoted 

7T : K" x R 1 '* x M 1 '* -> R" where R 7 <* = J|(R |b| \ {0}) 

bei 

the canonical projection on 1" (with respect to the first component), bdC is 
the boundary of the set C , and 

Ai, b = {(y,/3i,vi) £ R" x R J '* x R 7- * \ 
\\Xb~(y — Xj/3i)\\ = A, 
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We are now equipped to state our main sensitivity analysis result. 
Theorem 1 Let A > 0. Let y H, and j3(y) a solution of (V\(y)). Let L = 
supp e (/3(?/)) be the B-support of f3(y) such that (A(/3(j/))) holds. Then, there 
exists an open neighborhood of y O C W 1 , and a mapping j3 : O —> W such 
that 

1. For all y 6 O, f3(y) is a solution of (P\{y)), and j3{y) — j3{y). 

2. the B-support of fJ(y) is constant on O, i.e. 

Vy € O, supp B 0%)) = I, 

3. the mapping j3 is C 1 (C) and its Jacobian is such that Vy G O, 

%/3 7 c(y) =0 and dy^y) = d(y,X) (2) 
where d(y,\) = (XfX I + \5^ y) oP^ y) y 1 Xf (3) 
and L c = {b S B\b <£ 1} . (4) 



3 Degrees of freedom and Risk Estimation 

As remarked earlier and stated formally in Lemma [51 all solutions of the Lasso 
share the same image under X, hence allowing us to denote the prediction fi{y) 
without ambiguity as a single-valued mapping. The next theorem provides 
a closed-form expression of the local variations of fi.(y) with respect to the 
observation y. In turn, this will yield an unbiased estimator of the degrees of 
freedom and of the prediction risk of the group Lasso. 

Theorem 2 Let A > 0. For all y £ TL, there exists a solution j3(y) of (V\(y)) 
with B-support L = suppg(/3(y)) such that (A(/3(y))) is fulfilled. Moreover, 
The mapping y ^ ft(y) = X/3{y) is C x (K n \ H) and, 

div(fi(y))=tr(X I d(y,\)) (5) 
where (3{y) is such that (A(/3(y))) holds. 

Theorem 3 Let A > 0. Assume y = X[3q +e where e ~ Af(0, <T 2 Id n ). The set 
H has Lebesgue measure zero, and therefore ([5]) is an unbiased estimate of the 
DOF of the group Lasso. Moreover, an unbiased estimator of the prediction 
risk E E — Moll 2 is given by the SURE formula 

SUREOi(y)) =\\y - fl(y)\\ 2 - na 2 + 2a 2 tr(X z d(y, A)) . (6) 

Although not given here explicitly, Theorem [3] can be straightforwardly 
extended to unbiasedly of measures of the risk, including the projection risk, 
or the estimation risk (in the full rank c ase) through the G eneralized Stein 
Unbiased Risk Estimator as proposed in (jVaiter et a l l2012bh . 



An immediate corollary of Theorem [3] is obtained when X is orthogonal, 
and without loss of generality X = Id„, i.e. J2(y) is th e block soft thresho lding 
estimator. We then recover the expression found by (|Yuan and Linl 120061 ) . 
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Corollary 1 If X = Id„, then 



df = \i\-\Y, 



bei 



\b\-l 
INI 



where I = [J {b £ B \ \yb\ > A}. Moreover, the SURE is given by 
SURE(j£t(y)) = - no 



(2^ + A2)|J| + $>|| 2 -2^A£^ 

b bei ml 



We finally quantify the (relative) reliability of the SURE by computing the 
expected squared-error between SURE(/i(y)) and the true squared-error 

SE(Jl(y)) = my)-^\\ 2 ■ 



Proposition 1 Under the assumptions of Theorem^ the relative reliability 
obeys 



(suRE(£( y )) - snp(y))Y 



18 + 4E u; (||C/ / || 2 



IMP 



J 0{y) ° P p{y)) 



where Uj = XjXi (xJXj + A£§ 
In particular, it decays at the rate 0(l/n) ifK w (||l^/|| 2 ) = 0(1). 

Note that when X = Id„, the proof of Corollary [1] yields that \\Ui\\ = 1. 



4 Proofs 

This section details the proofs of our results. For a vector j3 whose 6-support 
is /, we introduce the following normalization operator 

N(p I )=v I where V6 € i> 6 = 

I Phi 



4.1 Preparatory lemmata 

By standard arguments of convex analysis and using the subdifferential of the 
group Lasso I 1 — £ 2 penalty, the following lemma gives the first-order suf ficient 
and n ecessary optimality condition of a minimizer of (V\(y)); see e.g. Bachl 



(2008) 



Lemma 1 A vector j3* 6 M. p is a solution of (V\(y)) if, and only if the fol- 
lowing holds 
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1. On the B-support I = supp e (/?*), 

Xj(y - Xifi) = \Af(fi). 

2. For all b e B such that b g" I , one has 

|.X?(i/-*,#)|<A. 

We now show that all solutions of (V\(y)) share the same image under the 
action of X, which in turn implies that the prediction/response vector /z is a 
single- valued mapping of y. 

Lemma 2 If (3° and /3 1 are two solutions of (V\(y)), then X(3° = Xf3 1 . 

Proof Let /3 ,/? 1 be two solutions of {V\(y)) such that X(3° ^ Xfi 1 . Take 
any convex combination f3 p = (1 — p)f3° + p/3 1 , p g]0, 1[. Strict convexity of 
u i— > \y — u\\ 2 implies that the Jensen inequality is strict, i.e. 

\\ y - xn < l -^\\v - Xff>\* + f |V - xn . 

Denote the i 1 — I 2 norm ||/3||s = J^beBb \\Pb\\- Jensen's inequality applied to 
||- Is gives 

Summing these two inequalities we arrive at -||y — X(3 P \ 2 + \\P p \q < -jy — 
Xj3°\\ 2 + A|j9°|b) a contradiction since j3° is a minimizer of (V\(y)). 



4.2 Proof of Theorem Q] 

We first need the following lemma. 

Lemma 3 Let [3 € W and A > 0. Assume that (A(/3)) holds for I the B- 
support of f3. Then XjXj + XSp o Pp is invertible. 

Proof We prove that XjXj + XSp o Pp is actually symmetric definite positive. 
First observe that XjXj and Sp o Pp are both symmetric semidcfinite posi- 
tive. Indeed, Sp is diagonal (with strictly positive diagonal entries), and Pp is 
symmetric since it is a block-wise orthogonal projector, and we have 

||Proj fl x(a:)| 2 lrl 

bei 1/561 

The inequality becomes an equality if and only if x = j3j, i.e. Ker Sp o Pp — 
{/?/}■ 

It remains to show that KeiXjXj n Ker Sp o Pp — {0}. Suppose that 
Pi e Ker XjXj. This is equivalent to /?/ G KerXi since 
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But this would mean that 

Xjfa = x bPb = 
be/ 

which is in contradiction with the linear independence assumption (A(/3)). □ 

Let y ^ H. We define / = supp 6 (/3(y)) the £>-support of a solution (3(y) of 
(V\{y)). We define the following mapping 

r(0 T , y) = XjiXjPj -y) + XtfiPi). 

Observe that the first statement of Lemma [1] is equivalent to r((3j(y), y) = 0. 
Any pi £ M) 1 ^ such that r(Pi,y) — is solution of the problem 

min l\\y-Xi(3i\\ 2 + \J2\\f3 a \\ ■ {V x {y)i) 

Our proof will be split in three steps. We first prove the first statement by 
showing that there exists a mapping y i— > P(y) and an open neighborhood O 
of y such that every element y of O satisfies r(Pi(y),y) — and Pi<=(y) = 0. 
Then, we prove the second assertion that (3(y) is a solution of {V\{y)) for 
y G O. Finally, we obtain (T2]) from the implicit function theorem. 

1. The Jacobian of r with respect to the first variable reads on R J <* x E™ 

d 1 r(pi,y)=XjXi+\8 0l oP ft . 

The mapping diT is invertible according to Lemma EL Hence, using the 
implicit function theorem, there exists a neighborhood O of y such that we 
can define a mapping j3i : O — > R' 7 which is C x {0), and satisfies for y 6 O 

r0i(y),y)=O and pi{y) = pi{y). 

We then extend Pi on J° as Pic (y) = 0, which defines a continuous mapping 
P : 6 -> W. 

2. From the second minimality condition of Lemma [TJ we have 

Vb£I, \\X?(y-Xipi(y))\\^\. 
We define the two following sets 

Jsat ={b?I\ \\Xj(y - XiPi(y))\\ = A} , 

Jnosat = {b / \ ||X b T (y - < A} , 

which forms a disjoint union of I c = J sat U J n0 sat- 
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a) By continuity of y h- > /3i(y) and since /3/(y) = Pi(y), we can find a 
neighborhood O of y included in such that 

Vy G O, V6 6 J nosat , ||A b T (y - < A. 

b) Consider now a block & G J sat . Observe that the vector (y , (3j (y) , M(f3j (y))) 
is an element of .4/^. In particular y G 7r(„4/ !b ). Since by assumption 

y ^ H, one has y bd(n(Ai ! b))- Hence, there exists an open ball B(y, e) 
for some e > such that B(y,e) C 7r(Aj,6). Notice that every element 
of y G B(y,e) is such that there exists (/?/,«/) G M 7 '* x R 1 ** with 

|X 6 T (y-X Ji 8 J )|=A 
XfiXtfa -y) + Xv! = 

v i =ncpi) ■ 

Using a similar argument as in the proof of Lemma [2J it is easy to see 
that all solutions of (P\(y)i ) share the same image under Xj. Thus the 
vector (y,(3i{y),N{(3i{y))) is an element Ai,b, and we conclude that 

Vy G M(y,e), f I>b (y) = \\Xj{y - XjMv))! = A- 

Hence, fij> is locally constant around y on an open ball O. 
Moreover, by definition of the mapping /3j, one has for all y G O D (5 

Xj(y - Xj^(y)) = AA/XMy)) and supply)) = 7. 

According to Lemma [TJ the vector /?(y) is a solution of (P\(y)). 
3. By virtue of statement 1., we are in position to use the implicit function 
theorem, and we get the Jacobian of /3j as 

dyMv) = ~(d 1 r0 I (y),y)y 1 (d 2 r0 I (y),y)) 

where <92-T(/3r(y), y) = Xj, which leads us to ©. 



4.3 Proof of Theorem M 
We define the set 

A! = jfr G M 1 ' 1 \ V M G M* 1 ,J2^ X bA< = =^ /i = o| , (7) 

where ||J. is the number of blocks in 7, and bi G 7 is the z-th block in 7. It is 
easy to see that /3* G Aj for 7 the S-support of /?* if and only if (A(/3*)). 

The following lemma proves that there exists a solution ft* of (V\(y)) such 
that (A(/3*)) holds. A similar result with a different proof can be found in 
(|Liu and Zhang||2009l ). 
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Lemma 4 There exists a solution ft* of {V\(y)) such that f3* £ Ai where 
I = suppeGS*). 

Proof Let /3° be a solution of (V\(y)) and / = supp B Q9°) such that ffi g Ar- 
There exists \i £ W 1 such that 

V 

^2^X^1=0. (8) 

i=l 

Consider now the family t n- /?' defined for every feK 

Mb, e /, = (1 + and #„ = . (9) 

Consider £rj = min{|i| 6l \ 36; G / such that 1 + tfii — 0}. Without loss of 
generality, we assume that to > 0. Remark that for all t £ [0,io)> P 1 is a 
solution of (V\(y)). Indeed, I is the ^-support of /?' and 

8/ 

= + i ^ Mi^/t = (10) 

i=l 

" v ' 

=0 using J5J 

Hence, 

Xj(y - Xjfi) = Xj(y - Xjft) = XAf(/3j) = AAA(#), 

and 

\\Xj(y - Xl fi)\ = \\Xj(y - Xj^)\ < A, Mb £ I c . 
Since the image of all solutions of (V\(y)) are equal under X, one has 

A^X/? and ||/3*|| B = \\p°\ B . 

where ||-||b is the I 1 — I 2 norm. Consider now the vector /3*°. By continuity of 
/3 i-> X/3 and /3 i-> ||/3||b, one has 

X/3 ta =X/3° and ||/3 to | B = \0°j B . 

Hence, /3'° has a £>-support 7 to strictly included in J (in the sense that for all 
b £ I to one has b £ I) and is a solution of (P\(y))- Iterating this argument 
with j3° — (3 to shows that there exists a solution j3* such that j3* £ A supp B (p*)- 
This concludes the proof of the lemma. □ 

According to Theorem[TJ y H- (3(y) is C 1 (M n \'H). This property is preserved 
under the linear mapping X which shows that ft, is also C 1 (M. n \H). Thus, taking 
the trace of the Jacobian Xid(y, A) gives the divergence formula (0 for any 
solution j3(y) such that (A(J3(y))) holds. 
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4.4 Proof of Theorem |3] 



The next lemma shows that the transition space has zero measure. 

Lemma 5 Let A > 0. The transition space W is of zero measure with respect 
to the Lebesgue measure o/R n . 

Proof We obtain this result by proving that all TLi.b are of zero measure for 
all / and b ^ I, an d that the un ion is over a finite set. 

We recall from (ICostell2002h that any semialgebraic set SCR" can be de- 
composed in a disjoint union of q semialgebraic subsets Ci each difieomorphic 
to (0, l) di . The dimension of S is thus 

d = max di S% n. 
»e{X,...,g} 

The set Ar,& is an algebraic, hence a semialgebraic, set. By the fundamental 
Tarski-Seidenberg principle, the canonical projection "k{AiJ)) is also semialge- 
braic. The boundary bd(7r(.4r,b)) is also semialgebraic with a strictly smaller 
dimension than iv(Ai,b) 

dim7{/.b = dimbd(7r(„4/.f,)) < &im.it(Ai,b) ^ n 

whence we deduce that H is of zero measure with respect to the Lebesgue 
measure on R™. □ 



As J2 is uniformly Lipschitz over R™, using similar arguments as in (jMever and Woodroofe 

2000), we get that fi is weakly differentiable with an essentially bounded gra- 
dient. Moreover, the divergence formula ([5]) holds valid almost everywhere, 
except on the set H which is of L ebesgue measure zero. We conclude by in- 
voking Stein's lemma (lSteinlll98ll) to establish unbiasedness of the estimator 
df of the DOF. 



Plugging the DOF expression into that of the SURE (|Steinlll981l Theo- 
rem 1), we get 



4.5 Proof of Corollary Q] 

When X = Id„, we have XjXj = Id/, which in turn implies that Id/ + 
^Pb(y) ° ^Mv) IS block-diagonal. Thus, specializing the divergence formula of 
Theorem [2] to X = Id„ yields 

A ^)° P /3(y)) _1 ) 

r* wmvw 
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where the last equality follows from the fact that Prois, u = Ick — ^^f^ff 
H J J P(y)t \\h(y)\\ 2 

is the orthogonal projector on a subspace of dimension |6| — 1. 

Furthermore, for X = Id n , /3b (y) has a closed- form given by block soft 
thresholding 

W = \(l-^ otherwise " < U > 

It then follows that 

1 =1 A 



1 + 1+ h^Fi W 

Piecing everything together, we obtain 

*-£Hn-'>('-£))-D*-*E^. 

As l-^l — Sbe/I^l' we S e t the desired result. Note that this result can be 
obtained directly by differentiating 



4.6 Proof of Proposition Q] 

Let's introduce the shorhand notation for the reliability 
R = E w [(SURE(/x(y)) - SE&(y))) 2 
Applying (jVaiter et alll2012bl . Theorem 4), we get 



R = 2na 4 -4a 4 tr (X r B(y, A)Xj(2Id„ - X r B(y, X)Xj))+WE w (\\fl(y) - /x | 2 

where -B(y, A) = (XjXj + A^^ o Pp/ y \) 1 is positive definite by Lemma[3l 

Let's bound the last term. By Jensen's inequality and the fact that (3(y) is 
a (global) minimizer of CPa (?/)), we have 

E w - /i | 2 ) < 2(E W (||y - /2(y)| 2 ) + E w (\\y - fi Q \\ 2 ) 



< 4E W ( i||y - M(y)| 2 + A^ l&fo)! + 271a 2 

ftee / 
sC 4E W (i||y|| 2 ) + 2tict 2 = 2||/i | 2 + 47KT 2 . 

Let's turn to the second term. We have 

tr (XjBiy, X)XjX I B{y, X)Xj) = tr (XjX I B(y, X)XjX I B{y 1 A)) 

= \\Xj XjB(y, X)\\ 2 F ^ n\\XjX I B(y,X)\\ 2 . 

In addition, XjB(y, X)Xj is semidefinite positive and therefore 

R ^ 2na 4 + 4<r 4 E w (tr (X I B(y, X)XfX I B(y, X)Xj)) + 4a 2 E w - Mo || 2 ) 

^ 2na 4 + 4n<T 4 E w (\XjXjBiy, X)f) + Wna 4 + 8ct 2 ||/x | 2 , 

whence we get the desired bound after dividing both sides by n 2 <r 4 . 



14 



Samuel Vaiter et al. 



References 

Akaikc H (1973) Information theory and an extension of the maximum likelihood principle. 
In: Second international symposium on information theory, Springer Verlag, vol 1, pp 
267-281 

Bach F (2008) Consistency of the group lasso and multiple kernel learning. Journal of 

Machine Learning Research 9:1179-1225 
Bakin S (1999) Adaptive regression and model selection in data mining problems. Thesis 

(Ph.D.)-Australian National University, 1999 
Bickel PJ, Ritov Y, Tsybakov A (2009) Simultaneous analysis of lasso and Dantzig selector. 

Annals of Statistics 37:1705-1732 
Biihlmann P, van de Geer S (2011) Statistics for High-Dimensional Data: Methods, Theory 

and Applications. Springer 
Candcs E, Plan Y (2009) Near-ideal model selection by i\ minimization. Annals of Statistics 

37(5A):2145-2177 

Coste M (2002) An introduction to semialgebraic geometry. Tech. rep., Institut de Recherche 

Mathematiques de Rennes 
Donoho D (2006) For most large underdetermined systems of linear equations the minimal 

^-norm solution is also the sparsest solution. Communications on pure and applied 

mathematics 59(6):797-829 
Dossal C, Kachour M, Fadili J, Peyre G, Chesneau C (2012) The degrees of 

freedom of penalized i\ minimization, to appear in Statistica Sinica URL 

http : //hal . archives-ouvertes . f r/hal-00638417 
Efron B (1986) How biased is the apparent error rate of a prediction rule? Journal of the 

American Statistical Association 81(394):461-470 
Kato K (2009) On the degrees of freedom in shrinkage estimation. Journal of Multivariate 

Analysis 100(7):1338-1352 
Liu H, Zhang J (2009) Estimation consistency of the group lasso and its applications. Journal 

of Machine Learning Research 5:376-383 
Mallows CL (1973) Some comments on cp. Technometrics 15(4):661-675 
Meyer M, Woodroofe M (2000) On the degrees of freedom in shape-restricted regression. 

Annals of Statistics 28(4):1083-1104 
Osborne M, Presnell B, Turlach B (2000) A new approach to variable selection in least 

squares problems. IMA journal of numerical analysis 20(3):389 
Solo V, Ulfarsson M (2010) Threshold selection for group sparsity. In: Acoustics Speech 

and Signal Processing (ICASSP), 2010 IEEE International Conference on, IEEE, pp 

3754-3757 

Stein C (1981) Estimation of the mean of a multivariate normal distribution. The Annals 

of Statistics 9(6):1135-1151 
Tibshirani R (1996) Regression shrinkage and selection via the Lasso. Journal of the Royal 

Statistical Society Series B Methodological 58(l):267-288 
Tibshirani RJ, Taylor J (2012) Degrees of freedom in Lasso problems. Tech. rep., 

arXiv:1111.0653 

Tikhonov AN, Arsenin VY (1997) Solutions of Ill-posed Problems. V. H. Winston and Sons 
Vaiter S, Deledalle C, Peyre G, Fadili J, Dossal C (2012a) Degrees of freedom of the group 

Lasso. In: ICMLT2 Workshops, pp 89-92 
Vaiter S, Deledalle C, Peyre G, Fadili J, Dossal C (2012b) Local behavior of sparse analysis 

regularization: Applications to risk estimation, to appear in Applied and Computational 

Harmonic Analysis URL http://hal.archives-ouvertes.fr/hal-00687751/ 
Wei F, Huang J (2010) Consistent group selection in high-dimensional linear regression. 

Bernoulli 16(4):1369-1384 
Yuan M, Lin Y (2006) Model selection and estimation in regression with grouped variables. 

J of The Roy Stat Soc B 68(l):49-67 
Zou H, Hastie T, Tibshirani R (2007) On the "degrees of freedom" of the Lasso. The Annals 

of Statistics 35(5):2173-2192 



